## Dataset

This Harvard Dataverse entry includes the following data resources (use the UTF-8 encoding when reading relevant CSV files into R or Python):

## CSV (text and meta-data) files: kremlin_mid_en_ru_final_csvs.zip 

A zipped file containing four analysis-ready, speech-level CSV tables (one per source-language corpus) that contain the full set of scraped metadata and the final modeling outputs:

-kremlin_english.csv: Kremlin corpus in English (original English pages).
-kremlin_russian.csv: Kremlin corpus in Russian, with the translated English text fields (e.g., full_text_english}) used for topic modeling while preserving the original Russian content.
-mid_english.csv: MID.ru corpus in English (original English pages).
-mid_russian.csv: MID.ru corpus in Russian, with translated English text fields used for topic modeling while preserving the original Russian content.


## Image files:

All scraped images referenced by the CSVs mentioned above are distributed into four separate zipped files for the four separate corpus-level archives (one per corpus):

-kremlin_english_images.zip
-kremlin_russian_images.zip
-mid_english_images.zip
-mid_russian_images.zip

Within each corpus-specific archive, image files are stored in standard web formats (predominantly .jpg, with occasional .png). The stored_image_filepaths column in each CSV provides the authoritative linkage from speech records to image files: each cell contains a list of image paths that are relative to the corresponding corpus image root directory. To locate an image, extract the relevant corpus archive and concatenate the corpus image root directory with the relative path listed in stored_image_filepaths. Image captions, when available, are provided in the parallel list-valued image_captions field.


## Auxiliary materials distributed with the data deposit: kremlin_mid_en_ru_auxiliary_files.zip 

Alongside the four CSVs and four image archives, this deposit includes a compact set of auxiliary resources to support inspection and reuse. The following materials are all located in the zipped folder mentioned immediately above:

-topic_summaries_html.zip: HTML topic-summary files (one per learned topic per corpus) for qualitative inspection of keywords, representative speeches, and representative images.
-text_and_image_topic_probability_files.zip: long-format document--topic and image--topic probability tables exported from the modeling pipeline. The contents include:

text_probability_files:
-kremlin_english_text_topic_probs.csv
-kremlin_russian_text_topic_probs.csv
-mid_english_text_topic_probs.csv
-mid_russian_text_topic_probs.csv

image_probability_files:
-kremlin_english_image_topic_probs.csv
-kremlin_russian_image_topic_probs.csv
-mid_english_image_topic_probs.csv
-mid_russian_image_topic_probs.csv

Together, these materials provide (i) analysis-ready speech-level tables, (ii) the complete set of linked image files, and (iii) optional topic-level and probability-level outputs that support robustness checks, alternative topic assignment strategies, and qualitative validation. 


#Additional resources

Further details on these datasets, variables, and the methods used to produce these data can be found in: Blinova, Daria, Gayathri Emuru, Rakesh Emuru, Kushagradheer Shridheer Srivastava, Mina Rulis, Sunita Chandrasekaran, and Benjamin E. Bagozzi. 2026. "Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches."

Further details on the code and intermediate inputs needed for producing these data can be found at: https://github.com/bagozzib/Russian-Speech-Text-and-Images

For questions, please contact Benjamin E. Bagozzi at bagozzib@udel.edu